0.1 Preamble

I built this workflow to support an evolutionary study of sedges found in North America (genus Carex). For this study, I used distribution data downloaded from GBIF to infer climatic niche.

GBIF data is notoriously dirty. Luckily, resources exist for distribution data cleaning (e.g., CoordinateCleaner) in R. Additionally, third-party resources, like range map database projects, exist that can be leveraged to curate distribution datasets.

In this demo, I’ll be using a third-party database project called BIEN (Botanical Information and Ecology Network; http://bien.nceas.ucsb.edu/bien/biendata/bien-2/species-range-models/) to identify potentially incorrect GBIF records.

Note that my workflow may be overly complex. It includes ~400 species’ datasets all analysed at once. In theory, this work would be easier if performed on a per species basis.

0.2 Packages

I like to use the set of packages called “tidyverse” to improve the readability of my code. If you use R regularly, and haven’t heard of tidyverse, I encourage you to check it out.

I like to shunt the internals of my analysis into functions stored in .R files. But for this demo, I’ll explicitly include all utility functions in the notebook.

library(tidyverse) # I use this for more readable code
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
source("../src/data/clean_gbif.R") # Stores utility functions
## Loading required package: rgbif
## Loading required package: sp
## Loading required package: countrycode
## Loading required package: CoordinateCleaner
## Loading required package: BIEN
## Loading required package: RPostgreSQL
## Loading required package: DBI
## Type vignette("BIEN") to get started
## The BIEN database has been updated to version 4.1 (as of 31-10-2018)!  This version includes nearly twice as many occurrence records and improved validations.
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
## Loading required package: gtools
## Loading required package: maptools
## Checking rgeos availability: TRUE
## Loading required package: sf
## Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3

0.3 Distribution data

I’ve already harvested GBIF distribution data for all Carex species used in my study. This process takes some time. Let’s import the data and take a look.

# (N.B.: removed gbif_data[[134]] Carex heteroneura, a duplicate)
# saveRDS(gbif_data, "../data/raw/gbif_data_demo.RDS")
gbif_data <- readRDS("../data/raw/gbif_data_demo.RDS")
gbif_species_list <- names(gbif_data)

There are 456 species in the GBIF dataset.

num_records_gbif_data <- lapply(gbif_data, nrow) %>% unlist
hist(num_records_gbif_data, main = "Number of GBIF Occurrence Records per Species", xlab = "Number of Occurrence Records", col = "light blue", breaks = 20)

The median number of records is 282 and the average is 963.6644737.

We can see that there is something strange happening on the right side of this histogram. A large number of species has ~5 000 records. This was the cap I placed on the number of records to be downloaded from GBIF. But could so many species really have so many records in GBIF?

GBIF data comes with a column called “coordinateUncertaintyInMeters”. Does this column help us to clean our GBIF data? Well, it isn’t always present, and the units used in the data may have been misused:

coord_uncertainty_gbif_data <- lapply(gbif_data, function(x) x$coordinateUncertaintyInMeters) %>% unlist 
coord_uncertainty_present_gbif_data <- coord_uncertainty_gbif_data %>% is.na %>% table
barplot(coord_uncertainty_present_gbif_data, col = "light green", main = "Number of Occurrence Records with Associated Uncertainty Data", xlab = "Uncertainty Data Present")

hist(coord_uncertainty_gbif_data, col = "light green", main = "Occurrence Record Uncertainty Data", xlab = "Coordinate Uncertainty in Meters")

0.4 CoordinateCleaner

CoordinateCleaner is a very useful package for flagging potentially erroneous occurrence records. If you are interested, this is the function I ran on my data:

clean_gbif <- function(dat) {
  print(dat$species %>% unique)
  flag <- try(clean_coordinates(x = dat, lon = "decimalLongitude", lat = "decimalLatitude", 
    countries = "countryCode", species = "species", tests = c("capitals", "centroids", 
      "equal", "gbif", "institutions", "zeros", "duplicates", "seas", "validity")))  # most tests are on by default
  return(flag)
}

The clean_coordinates function is the workhorse of the package. You can choose which tests you’d like to run based on what you suspect may be wrong with your data (e.g., capitals = do your distribution data match the coordinates of a capital city, centroids = do your distribution data match the coordinates within a radius around country centroids).

I’ll import the results of running clean_coordinates and show you the results visually for a few species.

library(CoordinateCleaner)

coordinate_cleaner_flags <- readRDS("../data/raw/cc_flags_demo.RDS")
#setwd("../../visualization/interim/coordinate_cleaner/2019-06-06")
#lapply(flags, function(x) plot_flag(x, "pdf")) # plot cc results

Let’s see what the maps look like:

# Plot examples of Coordinate Cleaner results
# par(mfrow=c(2,5)) # debug
subset_for_plotting <- sample(gbif_species_list, 10)
# subset_for_plotting <- c("Carex aestivalis", "Carex occidentalis")
coordinate_cleaner_flags[subset_for_plotting] %>% lapply(., function(x) plot(x, lon = "decimalLongitude", lat = "decimalLatitude", details = TRUE))
## $`Carex preslii`

## 
## $`Carex brainerdii`

## 
## $`Carex gynodynama`

## 
## $`Carex loliacea`

## 
## $`Carex fissuricola`

## 
## $`Carex flava`

## 
## $`Carex mackenziei`

## 
## $`Carex edwardsiana`

## 
## $`Carex raynoldsii`

## 
## $`Carex barbarae`

What kinds of numbers do we get in terms of identified issues?

coordinate_cleaner_flags_isolated <- coordinate_cleaner_flags %>% lapply(., function(x) x[,13:ncol(x)]) # Flags are in columns 13 onwards
coordinate_cleaner_flags_isolated_summary <- coordinate_cleaner_flags_isolated %>% lapply(., summary)
coordinate_cleaner_flags_isolated_summary <- map2(coordinate_cleaner_flags_isolated_summary, lapply(coordinate_cleaner_flags, nrow), c) # Add number of records. map2 is a purrr convenience function to map a function to two lists

barplot_species <- coordinate_cleaner_flags_isolated_summary[subset_for_plotting]
for (i in 1:length(barplot_species)) {
  barplot(barplot_species[[i]], main = names(barplot_species)[i], col = "light coral")
}

The most common issue with GBIF data is the presence of duplicate records. It looks like in my data, I also have problems with records occurring in the sea.

0.5 Range maps

It was recommended to me by a reviewer on my paper to verify my distribution data against range maps. As a result, I opted to try out BIEN range maps because they can be downloaded within R.

Note that these maps are estimations. BIEN has produced these maps using Maxent. Only ranges for species with five or more validly georeference observations were estimated using Maxent. For species with fewer than five points, BIEN uses a variety of approaches, as outlined in this decision tree.

Other range maps exist that can be used in an analysis like this. For example, the IUCN provides polygons for mammals, amphibians, birds, reptils, freshwater & marine animals, etc.

0.5.1 Download BIEN range maps

library(BIEN)

# Download BIEN range maps using species names
get_range_using_name <- function(sp_name) {
  print(sp_name)
  range <- try(BIEN_ranges_load_species(species = sp_name)) # I wrapped the function in try to allow it to fail and continue pulling data for subsequent species
  return(range)
}

# ranges <- mapply(get_range_using_name, gbif_species_list)
# saveRDS(ranges, file = "../data/interim/bien_ranges_demo.RData")
ranges <- readRDS("../data/interim/bien_ranges_demo.RData")

0.5.2 Clean BIEN range maps dataset

0.5.2.1 Remove species with no BIEN range map

# Here I am looping through the downloaded range maps and removing failures.
remove_failed_species <- function(ranges) {
  failed <- c()
  for (i in 1:length(ranges)) {
    if (is.character(ranges[[i]])) {
      failed <- c(failed, i)
    } else {
      ranges[[i]]@data[, ] <- names(ranges)[i]  # Fix BIEN name. BIEN returns species names separated by an underscore. My species list is space separated.
    }
  }
  ranges <- ranges[-failed]
  return(ranges)
}

cleaned_bien_ranges <- remove_failed_species(ranges)

0.5.3 BIEN download statistics

# First, match species in GBIF dataset and cleaned BIEN dataset.
bien_species_list <- names(cleaned_bien_ranges) # Generate a species name list
missing_bien_maps <- gbif_species_list %>% .[!. %in% bien_species_list] # Which species in the GBIF list are NOT in the BIEN download list?
subsetted_gbif_data <- gbif_data[bien_species_list]

Out of 456 species in the GBIF dataset, we were successful in downloading maps for 427 species. Hooray! The species for which no BIEN map was available are:

print(missing_bien_maps)
##  [1] "Carex assiniboinensis"   "Carex brysonii"         
##  [3] "Carex calcifugens"       "Carex collinsii"        
##  [5] "Carex corrugata"         "Carex crus-corvi"       
##  [7] "Carex cumberlandensis"   "Carex decomposita"      
##  [9] "Carex diluta"            "Carex divulsa"          
## [11] "Carex formosa"           "Carex gholsonii"        
## [13] "Carex heleonastes"       "Carex hookerana"        
## [15] "Carex incurviformis"     "Carex krausei"          
## [17] "Carex melanostachya"     "Carex mendocinensis"    
## [19] "Carex merritt-fernaldii" "Carex mesochorea"       
## [21] "Carex novae-angliae"     "Carex oklahomensis"     
## [23] "Carex ozarkana"          "Carex pachystachya"     
## [25] "Carex panicea"           "Carex pendula"          
## [27] "Carex saximontana"       "Carex thornei"          
## [29] "Carex wootonii"

0.5.3.1 Remove small BIEN ranges

I noticed while inspecting my maps that some BIEN ranges are TINY, and as a result occurrence records never fall inside the range. So I’ll remove these data before proceeding.

get_polygon_area <- function(range) {
  poly <- sf::st_as_sf(range)  # calculate area of polygon using sf package
  poly$area <- st_area(poly)  #Take care of units
  area <- poly$area %>% as.numeric
  return(area)
}

areas <- lapply(cleaned_bien_ranges, get_polygon_area) %>% unlist # first, get areas of ranges to weed out bad data. The area of the polygon must be a certain size. (Some BIEN ranges were tiny and not realistic!)
cleaned_bien_ranges <- cleaned_bien_ranges[areas > 1e+09]

# Re-run species list and data subset
bien_species_list <- names(cleaned_bien_ranges)
subsetted_gbif_data <- gbif_data[bien_species_list]

After removing unreliably restricted BIEN range maps, 398 species remain.

0.6 Visualize range maps

This is what BIEN range maps look like! They are SpatialPolygonsDataFrame objects that contain the projection and extent of the polygon.

0.7 Compare GBIF data to BIEN range maps

Now it’s time to use these range maps to identify erroneous GBIF occurrence records. To do this, I leveraged the CoordinateCleaner function cc_iucn.

From the cc_iucn documentation:

“Removes or flags records outside of the provided natural range polygon, on a per species basis.”

Luckily, all the function needs is a SpatialPolygonsDataFrame (which we have!) and the data to be tested. You must specify which columns contain your lat and lon data.

# Compare distribution data to range data and return the results in a vector
range_comparison <- function(dat, range) {
  range_flags <- try(cc_iucn(x = dat, range = range, lon = "decimalLongitude", lat = "decimalLatitude", value = "flagged", buffer = 1))
  dat_range_flags <- cbind(dat, range_flags)
  return(dat_range_flags)
}

# gbif_bien_comparison <- map2(subsetted_gbif_data, cleaned_bien_ranges, range_comparison) # map2 is a purrr convenience function to map a function to two lists
# saveRDS(gbif_bien_comparison, file = "../data/interim/gbif_bien_comparison_demo.RData")
gbif_bien_comparison <- readRDS("../data/interim/gbif_bien_comparison_demo.RData")

0.7.1 Plot GBIF distribution data and BIEN range map comparison

See below for an idea of how this comparison looks.

Normally, I would save these maps to a PDF file and run the plotting function on my full dataset. Then, I would examine these PDF files externally, so that I can flip through them using my computer’s file browser (faster than R).

Examining your data manually is still good practice, and most data providers will recommend it.

0.7.2 Remove occurrences from outside of BIEN range maps

As described previously, I like to visually examine my results. This tends to reveal problems that are impossible to anticipate.

For demonstration purposes, I’ll go ahead and remove all occurrences outside of BIEN ranges without such verification:

gbif_bien_comparison_filtered <- lapply(gbif_bien_comparison, function(x) x[x$range_flags == TRUE,]) # the range_flags column I've appended above will equal TRUE when the coordinate is within the BIEN range.

0.8 Summary of results

Now it’s time to examine the results of our BIEN range cleaning regime. There are countless ways to visualize results. Here are just a couple of easy R plots.

First, what is the relationship between the number of records we had per species before filtering and after filtering using BIEN ranges?

num_records <- lapply(gbif_bien_comparison, nrow) %>% unlist
num_records_filtered <- lapply(gbif_bien_comparison_filtered, nrow) %>% unlist
filtering_results <- cbind(num_records, num_records_filtered)
plot(filtering_results, main = "Number of Occurrence Records vs. Number of Filtered Occurrence Records", xlab = "Number of Occurrence Records", ylab = "Number of Occurrence Records After Filtering")
abline(1, 1, col = "light coral")

The above plot shows that those species with 5 000 records were well filtered: the maximum number of records post-filtered appears around 1500.

Most species required little filtering, as we can see with the data clumping near the 1:1 line. Let’s take a look at the summary statistics to learn more:

summary(filtering_results)
##   num_records      num_records_filtered
##  Min.   :   5.00   Min.   :   0.0      
##  1st Qu.:  96.75   1st Qu.:  74.0      
##  Median : 324.00   Median : 202.0      
##  Mean   : 945.93   Mean   : 389.3      
##  3rd Qu.: 899.50   3rd Qu.: 534.2      
##  Max.   :5000.00   Max.   :3056.0
difference <- num_records - num_records_filtered
summary(difference)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    3.00   10.50  556.67   44.25 5000.00

The median number of records removed was 10.5, which is considerably fewer than the average of 556.6683417.

This can be more clearly seen in the histogram below:

hist(difference, breaks = 20, col = "light coral", main = "Histogram of Number of Removed Records", xlab = "Number of Removed Records")

0.9 Conclusion

By utlizing resources that exist to verify distribution data, you can improve the quality of your dataset and the quality of your study’s results.